Structural Determinant of Protein Designability 
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Here we present an approximate analytical theory for the relationship between a protein structure's 
contact matrix and the shape of its energy spectrum in amino acid sequence space. We demonstrate 
a dependence of the number of sequences of low energy in a structure on the eigenvalues of the 
structure's contact matrix, and then use a Monte Carlo simulation to test the applicability of this 
analytical result to cubic lattice proteins. We find that the lattice structures with the most low- 
energy sequences are the same as those predicted by the theory. We argue that, given sufficiently 
strict requirements for foldability, these structures are the most designable, and we propose a simple 
means to test whether the results in this paper hold true for real proteins. 

PACS numbers: 



Successful protein design relies in part on knowledge 
of how a polypeptide chain's native structure varies as 
a function of its amino acid sequence. Yet even if this 
"protein folding problem" were solved in its entirety, the 
would-be protein designer would still face the formidable 
task of finding those sequences which folded to the target 
structure he or she wished to engineer 0,0. It is there- 
fore vital to know how many solutions to this search of 
sequence space exist for a given target structure, i.e. how 
designable the target structure is @, H S IE IE The 
question of what makes a particular protein structure 
more designable also bears fundamentally on our under- 
standing of how proteins first evolved. 

Past studies of designability have been limited in large 
part by their lack of generality. The previous contribu- 
tions in 0, for example, rely heavily on their study 
of two-letter monomer alphabets and Cartesian lattice- 
polymers, and can therefore offer no clear implications 
for the 20- letter, off-lattice world of real proteins 0. In 
contrast, studies like that of Koehl and Levitt come 
closest to probing real protein designabilities, but have 
no theoretical foundations from which to extrapolate be- 
yond their numerical results. Finally, a number of in- 
vestigations have assumed that the distribution of amino 
acid sequence energies is either nearly or totally in- 
dependent of the structural topology of the target fold, a 
premise which is flatly contradicted by the findings in 5] . 
The intention of this communication is therefore to iden- 
tify a theoretically motivated, generally applicable quan- 
titative measure of structural topology which we expect 
to be a good predictor of designability. 

In this Letter, we develop an approximate analyti- 
cal theory of the spectrum of possible monomer sequence 
energies for a heteropolymer in any given conformation. 
Our theory leads us to identify a novel topological prop- 
erty of a conformation, its contact trace, as a determinant 
of the shape of the conformation's sequence spectrum. 
We then confirm the predicted usefulness of the contact 
trace by running Monte Carlo simulations under condi- 



tions which are more realistic than those for which the 
contact trace was originally identified. Finally, we use the 
results of these simulations to suggest a connection be- 
tween the contact trace and designability, and we propose 
a way to test this hypothesis on real protein structures. 

As a preliminary, we must draw a distinction between 
what we term "strong designability" and "weak des- 
ignability" . We define a structure's "strong designabil- 
ity" in the same way that previous studies have defined 
its "designability" : as the total number of amino acid 
sequences that fold to that structure. Thereotical argu- 
ments based on the random energy model suggest that 
in order for a sequence to be foldablc, its native state 
en ergy m ust lie below a certain low-energy threshold E c 
[MlE[lE ini>LL3l> which represents the lowest energy range 
accessible to misfolded conformations. We therefore de- 
fine a structure's "weak designability" to be the fraction 
of its sequences which lie below a low energy cutoff E c . 
It is important to note that though the strong and weak 
versions of designability are set apart a priori, it is pos- 
sible that they will turn out to be the same thing in 
practice. Indeed, we will argue later on that if natural 
thermodynamic and kinetic requirements for folding are 
sufficiently stringent, then a structure's strong and weak 
designability become essentially indistinguishable. 

Our first aim is to determine whether contact topol- 
ogy effects a structure's distribution of possible sequence 
energies, and to this end we derive a closed-form sequence 
partition function for a special class of amino acid alpha- 
bets. We begin by considering a polymer of N monomers, 
where each monomer can be one of 2M different possi- 
ble kinds. We may construct for any polymer config- 
uration an N x N traceless contact matrix C whereby 
lattice nearest neighbors that are not sequence neigbors 
are considered to be in contact [lflj . Next, we represent 
the amino acid type of the ith monomer in the chain as a 
2M dimensional unit- vector s"> — [0, . . . , 0, 1, 0, . . . , 0], 
where the non-zero vector element is the fcth element of 
the vector if the ith monomer is of type k. 



We define the Hamiltonian to be a standard nearest- 
neighbor contact potential, i.e. 

N,N 
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where B is the 2M x 2M matrix of interaction energies 
for the different pairs of monomer types. A closed form 
expression for the partition function of this Hamiltonian 
may be obtained if we let B take on the special form 
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(2) 



In other words, B is the direct product of a matrix V 
with an Ising-like potential. Though we have now con- 
strained the nature of our contact potential, we still re- 
tain unlimited freedom to choose the types of interactions 
represented in V. 

Now let us define the M-vector a through tJk+i = 
S2fe+i — S2k+2- Our Hamiltonian becomes 



N,N 

id 
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(3) 



where a^ 1 ' is a vector of unit length whose single non-zero 
element may be either 1 or — 1. 

At this point we define the MN x MN block matrix U 
through U 



M(i-l)+k,M(j-l)+l 



Ci t jVk,i- 



In other words, 



we turn every element of C into an M x M block which 
couples the vectors {a W} through V wherever there is 



a contact. Finally, if we write a = [a 
the Hamiltonian can be expressed as 



H = 



1 



(U<x) 



W] then 



(4) 



This form of the Hamiltonian will allow us to calculate 
the sequence space partition function, for any contact 
map and any potential matrix V. To do so, we em- 
ploy a continuous-spin approximation, allowing each vec- 
tor a to swing anywhere on the M-dimensional unit 
sphere instead of restricting it to one of the 2M available 
unit-integer lattice points. This approximation not only 
smears the discrete sequence spectrum into a continuous 
one, but also distorts the spectral width by altering the 
relative sizes of M and M — 1 letter sequence spaces, 
thereby preventing the theory from making quantita- 
tively accurate predictions. Our assumption is that the 
model nevertheless retains important information about 
the effect that variations in contact topology have on the 
shape of the spectrum. The partition function now be- 



comes 



Z{[3) = fd MN a (f[S 
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Assuming hereafter that M ^> 1, we have 



d M aexp 



d\a\6(\a\-l) (6) 



Defining M — (M - 1)1, where I is the MN x MN 
identity matrix, and u = M _1 U we may write 



Z(/3) = J d MN aexp 



-~a-(M + f3V)a 



(T) 



and, since U is a real symmetric matrix, we finally obtain 



m 

Z(0) 
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(8) 



If we use det A = exp Tr In A to expand In z to 0(/3 2 ) and 
inverse-Laplace transform the resulting approximation of 
the partition function to obtain a distribution of sequence 
energies n(E), we get 



n(E) 



1 

2^ 



^ E z(i[3)d(3 : 



exp 



E 2 



exp 



Tr u 2 

E 2 



2N c a 2 B 



(9) 



where Nq is the number of contacts and <j 2 b is the vari- 
ance of the interaction energies in B. A similar result 
was obtained in [2j by a straightforward expansion of 
the sequence partition function. Equation says that 
the Gaussian approximation of the density of states leads 
to a naive sequence space random energy model (REM) 
for n(E). It is this REM picture, in which all struc- 
tures with the same number of contacts have identical 
sequence spectra, which has been used implicitly in |(J. 
We demonstrate below, however, that the consideration 
of higher order terms in the expansion of In z can have a 
profound impact on how we understand designability. 

Let us now consider the free energy F = -jjlnz of 
our sequence space partition sum. Defining the matrix 
v = M 1 _ 1 V and expanding about high "design tempera- 
ture" (i.e. requiring that|/3Aj| < 1, for all Aj which are 
eigenvalues of u), we get 



F = -?(Tr v 2 )(Tt C 2 ) + ^(Tr « 3 )(Tr C 3 ) 
4 6 



(10) 



-(Trw 4 )(TrC 4 ) + 0(/3 4 ) 



Those structures which minimize F will be the ones with 
the greatest number of amino acid sequences that have 
low energy when threaded onto that structure. Mini- 
mization of F is therefore a means to maximize weak 
designability, which we recall is determined by the frac- 
tion of low-energy sequences in a structure's sequence 
spectrum. Thus, we now consider which choices of the 
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FIG. 1: The difference in sequence space entropy between 
an energy near the peak of all structural sequence spectra 
(E = —2) and one in the lower tail of all spectra (E — — 8) 
as a function of the contact trace, measured here by the 
largest eigenvalue of the structure's contact matrix (which 
follows from Tr C n = A"). Each point was generated from 
data collected while slowly annealing a Monte Carlo sequence 
design simulation from high temperature (T — 2) to low 
(T = .2), with 10 7 Monte Carlo steps taken at each tempera- 
ture. The boxed points correspond to structures which were 
chosen by hand so as to ensure that the extrema of the range 
of possible eigenvalues were represented. All other structures 
were chosen randomly. 
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FIG. 2: The change in sequence space entropy from energy 
(E)(T = do) = E to energy E for three structures with largest 
contact matrix eigenvalues of 3.02 (high trace), 2.78 (interme- 
diate trace), and 2.60 (low trace). 



contact matrix C would serve to make the free energy 
F as negative as possible. Because Tr C n is equal to 
the number of n-step paths along the contact map which 
return to their starting place, we know that all such con- 
tact traces must be positive. Thus, the exact behavior 
of the series in IjlOjl will hinge on whether the largest 
eigenvalues of v are positive or negative. 

For either type of potential matrix v, however, we ex- 
pect that there will be some positive correlation between 



the trace of an even power of a structure's contact ma- 
trix and the number of low energy monomer sequences in 
that structure. Furthermore, the dependence of the free 
energy expansion in (fTU|l on such coarse quantities as the 
traces of powers of v suggests that the impact of the con- 
tact matrix on the spectrum of sequence energies should 
be relatively insensitive to the detailed features of the 
potential. We therefore determined to empirically test 
whether the above results remained valid for a discrete 
monomer alphabet which violated the special form of the 
potential assumed in J2J) . We first calculated the contact 
matrices for all 103346 different compact conformations 
of 27-mers on a cubic lattice 01- Next, we calculated (E) 
vs. T annealing curves for random starting sequences on 
different structures for a standard Monte Carlo search of 
sequence space with a move set containing composition- 
preserving two-monomer and three-monomer permuta- 
tions. The energy of each sequence was determined using 
a potential set given by Table 6 of 0] . This set of inter- 
actions, where average interactions are subtracted out, 
is one of the most diverse potentials possible for a 20- 
letter alphabet, and therefore provides the most general 
empirical test of the predicted relationship between se- 
quence energies and contact topology. From the anneal- 
ing curves, we then calculated the entropy in sequence 
space S(E) according to the prescription given by eq. 

(ii)of0. 

Figure ^ plots the sequence space entropy differ- 
ence between low and near-modal energy versus the 
largest eigenvalue of the structure's contact matrix for 86 
randomly-selected lattice structures. As predicted, the 
entropy difference between the peak and the left tail de- 
creased as the largest contact matrix eigenvalue increased 
(corr. = —0.92), indicating that more sequences have low 
energy in high trace structures. Figure [21 illustrates that 
the effect observed in Fig. ^ results from global differ- 
ences in the shapes of the sequence spectra of high trace 
and low trace structures. The higher the contact trace, 
the more gradually the number of sequences falls off as 
energy decreases, and therefore the greater the relative 
number of sequences of low energy. Clearly, the contact 
trace of the target structure controls how low in energy a 
Monte Carlo sequence optimization algorithm running at 
fixed temperature Td es will be able to search. The greater 
the contact trace, the larger the S(E) at low energies, i.e. 
the greater the weak designability. 

Interestingly, the most designable 27-mer structures 
identified using our maximum eigenvalue determinant 
are similar to the one identified in 01 using random 
sampling of sequences and a different, " solvation-like" 
Miyazawa- Jernigan potential. This attests to a general- 
ity of our proposed structural determinant of designabil- 
ity with respect to potentials. 

Structures of high contact trace are weakly des- 
ignable, but are they strongly designable? In order to 
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FIG. 3: We designed sequences on two different target struc- 
tures in Monte Carlo searches of 2 x 10 r steps (.2 < T < .4), 
sampled at every one sequence in 10 4 . Here, we plot the num- 
ber of sequences whose unique compact ground states were the 
structure on which they were designed, as a function of the 
difference in energy between the sequence's ground state and 
the lowest-lying excited state which shares fewer than 25% of 
its contacts with the ground state. 



We have presented an analytical theory which iden- 
tifies the first instance of a generally applicable, well- 
defined, numerical measure of a protein structure's topol- 
ogy which is expected to correlate with the structure's 
designability. Using a Monte-Carlo search in the se- 
quence space of a lattice model with 20-letter energetics, 
we have shown that the implications of the theory ex- 
tend beyond the special assumptions under which they 
were first developed. The finding that higher contact 
trace may lead to greater potential for thermal stabil- 
ity leads us to hypothesize that thermophilic organisms, 
whose natural environment makes uncommonly strict de- 
mands for protein stability, exhibit a genomic bias to- 
wards folds of higher contact trace. We have recently 
found that this bias exists (unpublished results), provid- 
ing further encouragement that the contact trace may 
yet offer new insight into the laws governing structural 
diversity in natural proteins. 

We thank Edo Kussel, William Chen, Nikolay 
Dokholyan, and Lewyn Li for valuable discussions, and 
NIH and Pfizer Inc. for support. 



address this question, we examined the stability of se- 
quences designed on two structures of maximal and min- 
imal contact trace. For each target structure, we de- 
termined how many of its designed sequences were "on- 
target" , that is, had the target structure as their unique 
energetic ground state determined over all compact con- 
formations, and calculated the gap in energy between 
that of the ground state and those of the lowest-energy 
structures with low structural similarity to the ground 
state conformation. We found that while only 12% of 
the sequences designed for the low trace structure were 
on-target, 24% were in the case of the high trace struc- 
ture. Furthermore, as Figure |31 illustrates, the sequences 
designed successfully on the high trace structure tended 
on average to have larger energy gaps than their low trace 
cousins, consistent with earlier observations that low en- 
ergy in the native conformation also leads on-average to 
a larger gap and greater stability 0, 0] 

Fig. therefore gives us a means to unify strong and 
weak designability. Past studies have suggested that in 
order for a protein sequence to fold rapidly to its na- 
tive state, it must have a larger-than-average energy gap 
|l2Lll8lll9l |. Fig.[3]demonstrates that if the conditions for 
folding stably and quickly in nature demand sufficiently 
high energy gaps, weak designability will be one and the 
same with strong designability, since only by having very 
low energy in its target structure will a sequence have an 
appreciable chance of exhibiting the gaps which are ther- 
modynamically and kinetically required by the natural 
environment. We speculate that protein evolution under 
such conditions would lead to the emergence of natural 
protein folds with near-optimal contact traces. 
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